Final Project - Countries categorization

A Jupyter notebook showing a Deep Learning problem description, EDA procedure, analysis (model building and training), result, and discussion/conclusion.

Problem

The problem to solve is to suggest the countries that are in the direct need of aid for the CEO of HELP International. The job is to categorize the countries using some socio-economic and health factors that determine the overall development of the country. For this problem, I will K-means algorithm to divide the data into categories.

Dataset

The dataset comes from Kaggle (https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data). There are 167 rows and 10 columns.

  1. country: Name of the country
  2. child_mort: Death of children under 5 years of age per 1000 live births
  3. exports: Exports of goods and services per capita. Given as %age of the GDP per capita
  4. health: Total health spending per capita. Given as %age of GDP per capita
  5. imports: Imports of goods and services per capita. Given as %age of the GDP per capita
  6. income: Net income per person
  7. inflation: The measurement of the annual growth rate of the Total GDP
  8. life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
  9. total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
  10. gdp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Load libraries

1. Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data

1.1. Data import and basic inspection

Importing the data from the Country-data CSV file and inspecting its content.

1.2. Inspect null

Identifying if there is null values on the features.

1.3. Dropping columns

Dropping the country column because it is not needed for the modeling.

1.4. Correlation matrix

Identifying which variable is more correlated by which variable.

1.5. Pairplot

Finding the relationship between the variables where they can be continuous of categorical.

2. Model building, training and result

2.1. Standardization

To put different features on the same scale, data standardization is going to be applied to bring data into a uniform format. Thus, it will allow to compare scores between different types of variables.

2.1. K-means

For this problem, K-Means algorithm is used because it is a very powerful clusterization algorithm for dividing data into categories.

2.1.1. Finding optimal number of clusters

To find optimal number of clusters, I will use elbow and silhouette methods.

2.1.2. Using the optimal cluster number

Performing K-means using the cluster number 4 according to the silhouette method

2.2. Results

After the data has been classified, the results are visualized across a geographic area.

Most of the Underdeveloped countries seems to be in African Continent.

Some of the Undeveloped countries seem to be in Asian Continent

List of undeveloped countries

3. Conclusion

I can conclude that the countries that will need aid are in the African Continent and some in the Asian Continent. To get this result, I used K-Means algorithm because it is a very powerful clusterization algorithm for dividing data into categories. But before performing this algorithm, I used the Elbow and Silhouette methods to find the optimal number of clusters. The Elbow method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The Silhouette method provides a succinct graphical representation of how well each object has been classified. Both methods were used because they provide valuable information for clustering analysis. For this data, the cluster number were the same on both methods.